Welcome to the Festival of Genomics workshop on single-cell analyses

This is an R Markdown document that contains instructions and code for the examples used in todays workshop. The first few steps will check that you have all the packages and data files necessary to carry out all of the analyses.

Hour 1: Getting Started

Sara’s section …

Check that the Brain Atlas data files are present

The following code chunk assumes that Brain Atlas files have been downloaded and placed in the “BrainAtlas” subdirectory of a folder in your home directory entitled “FestivalWorkshopSC”. If you have downloaded these files to another location, either create a new folder and move the files there, or substitute the file path to where they are currently located for “~/FestivalWorkshopSC/BrainAtlas”. If you are using the RStudio Server instance provided by the workshop, change the first line that follows to setwd("/home/FestivalWorkshopSC/BrainAtlas")

setwd("~/FestivalWorkshopSC/BrainAtlas")
file.exists("cell_metadata.csv")
## [1] TRUE
file.exists("genes_counts.csv")
## [1] TRUE
file.exists("ercc_counts.csv")
## [1] TRUE
file.exists("README.txt")
## [1] TRUE

If any of the preceding lines return FALSE, double check that you have set the correct working directory and that all download files have been placed in that folder. If they are missing, you can download the files here. Note that there will be more files than listed here (including alternate gene count quantifications, and metadata about data-driven cluster memberships as discussed in the paper “Adult mouse cortical cell taxonomy revealed by single cell transcriptomics”), but these are the ones we will make use of.

Read the Brain Atlas data files into R

The read.csv function in R is useful for reading in .csv (comma-separated value) files. First, we’ll read in the main data file genes_counts.csv to a data frame using this function and check its contents. The extra arguments to this function help to format our object so that we have row and column names, and we use character variables instead of converting to factors. For more details on these arguments, you can type help(read.csv). In the resulting coutns object, we have genes in rows (24057) and cells in columns (1679).

counts <- read.csv("genes_counts.csv", stringsAsFactors = FALSE, header=TRUE, row.names = 1)
str(counts[,1:20]) # restrict to the first 20 columns (cells) 
## 'data.frame':    24057 obs. of  20 variables:
##  $ A01101401: num  0 992 2.57 0 0 0 0 0 0 1 ...
##  $ A01101402: num  0 2287 177 0 0 ...
##  $ A01101403: num  0 492 0 0 0 ...
##  $ A01101404: num  0 1932 1 0 0 ...
##  $ A01101405: num  0 1425 2 0 0 ...
##  $ A01101406: num  0 130 3 0 0 ...
##  $ A01101407: num  0 2110 3041 0 17 ...
##  $ A01101408: num  0 955 101 0 0 ...
##  $ A02271433: num  0 326 0 0 0 349 0 0 0 0 ...
##  $ A02271434: num  0 933 1042 0 0 ...
##  $ A02271435: num  0 296 0 0 0 ...
##  $ A02271436: num  0 31 200 0 0 984 0 0 0 36 ...
##  $ A02271437: num  0 970 0 0 3 ...
##  $ A02271438: num  0 594 355 0 228 ...
##  $ A02271439: num  0 2290 3294 0 0 ...
##  $ A02271440: num  0 29 1 0 0 ...
##  $ A12101401: num  0 373 540 0 0 ...
##  $ A12101402: num  0 462 197 0 0 0 0 0 0 0 ...
##  $ A12101403: num  0 516 96 0 498 ...
##  $ A12101404: num  0 1019 0 0 0 ...

The cell_metadata.csv file contains 1679 rows (one for each cell) and columns containing information such as collection date, sequencing type, total reads, mapping percentage, dissection layer, and major/minor derived cell subtypes.

cells <- read.csv("cell_metadata.csv", stringsAsFactors = FALSE, header = TRUE)
str(cells)
## 'data.frame':    1679 obs. of  16 variables:
##  $ long_name         : chr  "A01101401" "A01101402" "A01101403" "A01101404" ...
##  $ cre               : chr  "Calb2" "Calb2" "Calb2" "Calb2" ...
##  $ collection_date   : chr  "11/18/2013" "11/18/2013" "11/18/2013" "11/18/2013" ...
##  $ sequencing_type   : chr  "hiseq" "hiseq" "hiseq" "hiseq" ...
##  $ total_reads       : int  23770190 9694719 5864322 22102121 24057147 24171169 22447919 20995719 9023705 23016406 ...
##  $ all_mapped_percent: num  93.5 92.9 90.5 93.2 93.1 ...
##  $ mRNA_percent      : num  54.4 45.7 48.3 51.4 51.1 ...
##  $ genome_percent    : num  30.1 35.5 34 33.8 32.1 ...
##  $ ercc_percent      : num  4.36 7.84 4.12 4.24 4.98 3.14 3.12 2.27 3.22 2.13 ...
##  $ tdt_permillion    : num  306 341 106 371 264 ...
##  $ major_class       : chr  "Inhibitory" "Inhibitory" "Excitatory" "Inhibitory" ...
##  $ sub_class         : chr  "Vip" "Vip" "L4" "Vip" ...
##  $ major_dissection  : chr  "V1" "V1" "V1" "V1" ...
##  $ layer_dissectoin  : chr  "All" "All" "All" "All" ...
##  $ color_code        : int  11 11 11 11 11 11 11 11 11 11 ...
##  $ short_name        : chr  "A200_V" "A201_V" "A202_V" "A203_V" ...

The ‘ercc_counts.csv’ file contains 1679 columns (one for each cell) and 93 rows containing the counts for the ERCC spike-in control RNA transcripts (and spike in tdTomato). The ERCC spike-ins are a standard set of RNA transcripts that are spiked in at known concentrations to each cell (more on this later). We’ll remove the tdTomato transcript since this does not serve the same purpose as the ERCC spike-ins.

ercc <- read.csv("ercc_counts.csv", stringsAsFactors = FALSE, header = TRUE, row.names=1)
str(ercc[,1:20]) # restrict to the first 20 columns (cells)
## 'data.frame':    93 obs. of  20 variables:
##  $ A01101401: int  167620 16658 35838 18394 0 0 0 0 0 0 ...
##  $ A01101402: int  131187 10413 31563 20370 0 0 0 0 0 578 ...
##  $ A01101403: int  37523 2248 9723 1342 0 0 0 0 0 185 ...
##  $ A01101404: int  136565 4789 34892 8611 0 0 1354 0 1 0 ...
##  $ A01101405: int  222462 15067 53617 19096 0 0 0 0 0 0 ...
##  $ A01101406: int  125448 7055 31013 4439 0 0 0 0 0 0 ...
##  $ A01101407: int  109946 9494 25467 4122 0 0 1 0 0 0 ...
##  $ A01101408: int  79354 5755 17085 9585 0 0 0 0 0 296 ...
##  $ A02271433: int  29555 5180 26581 3004 0 0 0 0 0 304 ...
##  $ A02271434: int  58598 4188 29983 3939 0 0 0 0 0 233 ...
##  $ A02271435: int  18494 836 8201 359 0 0 0 0 0 0 ...
##  $ A02271436: int  10725 764 5968 619 0 0 0 0 0 0 ...
##  $ A02271437: int  48311 1914 18605 1635 0 0 0 0 0 634 ...
##  $ A02271438: int  31437 1279 12595 2102 0 0 0 0 0 332 ...
##  $ A02271439: int  58475 5016 25756 11074 0 0 0 0 0 0 ...
##  $ A02271440: int  127465 6838 67261 2358 0 0 0 0 0 0 ...
##  $ A12101401: int  48418 2363 8437 3996 0 0 0 0 0 119 ...
##  $ A12101402: int  77486 3819 17759 4358 0 521 0 0 0 0 ...
##  $ A12101403: int  72009 4899 13997 4956 0 0 0 0 0 0 ...
##  $ A12101404: int  62729 2947 16044 1551 0 0 0 0 0 0 ...
# remove the tdTomato row
whichTomato <- grep("tdTomato", rownames(ercc))
ercc <- ercc[-whichTomato,]

More detailed information about the files downloaded from the Allen Brain Atlas can be found in the README.txt file provided. Here is a peek at the contents of that file.

# This is a bash command, to be executed at the command line (not within R);
# Alternatively, simply open the README.txt in your favorite text editor to view its contents
head README.txt
## Description of files contained in this data download:
## 1.   genes_counts.csv: File containing read count values obtained from the RSEM algorithm for each gene (row) for each cell (column)
## 2.   genes_rpkm.csv: File containing the RPKM values obtained from the RSEM algorithm for each gene (row) for each cell (column)
## 3.   ercc_counts.csv: File containing read count values obtained from the RSEM algorithm for each external ERCC spike-in control (row) for each cell (column)
## 4.  cell_metadata.csv: File containing information about each cell profiled, including its nomenclature, Cre line of origin, dissection, date of collection and sequencing, and read mapping statistics
## 5.   cluster_metadata.csv: File containing information about each data-driven cluster, including its label, the corresponding label in Tasic et al. (Nat. Neuro, 2106), the primary cell class membership, and marker genes (including genes with widespread expression in the cluster, sparse expression in the cluster, and no expression in the cluster). 
## 6.   cell_classification.csv: File containing information about the cluster membership of each cell, including whether the cell is a "core" (unambiguously assigned to a single cluster) or "transition" (shares membership between two clusters) cell, as well as its membership score (from 0-10) for each cluster (labeled f01 to f49). 
## 
## To generate the count and RPKM data, 100bp single-end reads were aligned using RSEM to the mm10 mouse genome build with the RefSeq annotation downloaded on 11 June 2013.
## Raw fastq files are available at Gene Expression Omnibus, accession ID GSE71585

Check that the desired R packages have been installed

Once a list of desired R packages is finalized, can check that they are installed with

require(scde)         #bioconductor
require(monocle)      #bioconductor
require(scran)        #bioconductor
require(scater)       #bioconductor
require(Biobase)
require(scDD)         #github
require(ggplot2)      #cran
require(devtools)     #cran
require(RColorBrewer) #cran

If any of these commands return a message that includes “there is no package called…”, then the package is missing and needs to be installed. Note that packages may be stored in one of several package repositories. The most popular are Bioconductor, github, and CRAN. For Bioconductor packages, for example edgeR, this can be done with the following code:

source("http://bioconductor.org/biocLite.R")
biocLite("monocle")

For CRAN packages, for example devtools, installation can be done with the following code:

install.packages(devtools)

For Github packages, for example scDD, installation can be done with the following code:

install.packages("devtools")
devtools::install_github("kdkorthauer/scDD")

Visualize major axes of variation in a PCA plot

# extract top 1000 variable genes
gene.var <- apply(counts, 1, function(x) var(log(x[x>0])))
counts.top1000 <- counts[which(rank(-gene.var)<=1000),]

counts.pca <- prcomp(log(counts.top1000+1),
                   center = TRUE,
                   scale. = TRUE) 
summary(counts.pca)$importance[,1:5]
##                             PC1      PC2     PC3      PC4      PC5
## Standard deviation     24.55390 14.19966 7.67291 6.620296 5.769057
## Proportion of Variance  0.35908  0.12009 0.03506 0.026100 0.019820
## Cumulative Proportion   0.35908  0.47917 0.51423 0.540340 0.560160
plot(counts.pca, type="l", main="Top 10 PCs")

color_class <- rainbow(length(unique(cells$major_class)))
plot(counts.pca$rotation[,1], counts.pca$rotation[,2], 
      xlab="PC 1", ylab="PC 2", col=color_class[as.numeric(factor(cells$major_class))], pch=20,
      main="PCA plot of cells colored by derived major class")

color_class <- rainbow(length(unique(cells$layer_dissectoin)))
plot(counts.pca$rotation[,1], counts.pca$rotation[,2], 
      xlab="PC 1", ylab="PC 2", col=color_class[as.numeric(factor(cells$layer_dissectoin))], pch=20,
      main="PCA plot of cells colored by Dissection Layer")

Hour 2: Normalization and Quality Control of scRNA-seq

Apua’s Section …

Normalization

Quality Control (QC measures)

detectionRate <- apply(counts, 2, function(x) sum(x > 0) / length(x))
hist(detectionRate)

Hour 3: Analysis Modules

Keegan’s Section …

Identify the highly variable genes

In this module, we will identify genes that are highly variable across the entire cell population. This will give us a subset of genes to focus on that is likely enriched for those that are driving heterogeneity among cellular subtypes.

For ease in downstream analysis with various R packages we’ll make use of today, we’ll convert the data.frames that currently (separately) house the counts and cell metadata into a Bioconductor object called an SCESet introduced by the scater package. This object is a container that can hold raw and normalized expression values, along with the metadata for both samples and genes, in one place.

library(scater)
library(scran)
rownames(cells) <- cells$long_name

# construct a SCESet that also contains the gene counts, ercc counts, and metadata
all.equal(colnames(counts), colnames(ercc)) # check that the cells are in the same order across the two datasets
## [1] TRUE
counts.all <- rbind(counts, ercc)  # combine the two into one data.frame
eset <- newSCESet(countData = counts.all, phenoData = AnnotatedDataFrame(cells))
isSpike(eset) <- grepl("^ERCC", rownames(eset))  #designate which rows contain spikeins instead of genes (for HVG analysis)

rm(counts); rm(counts.all) # remove counts and counts.all matrices to free up memory (the counts are now stored in the eset object)

Now that our SCESet has been created, we’ll also perform some basic preprocessing and normalization here to adjust for library size using the pool and deconvolve method in the scran package, as well as remove genes with very low expression (*** these steps may not be necessary if this was already carried out in Apua’s section… Can refer back to the section or move/remove if it is redundant ***).

# QC to compare the level of dropout in endogeneous genes to ERCC spike ins in raw data
eset <- calculateQCMetrics(eset, feature_controls=isSpike(eset))
plotQC(eset, type = "exprs-freq-vs-mean")

Here we apply a couple of filters to remove genes that are expressed at a uniformly low level across the population, and then apply our chosen normalization procedure.

# first, filter out genes that are almost always zero (at least 50 out of 1679 cells must have nonzero expression)
keep <- rowSums(counts(eset) > 0) >= 50
eset <- eset[keep,] 
sum(keep)
## [1] 15341
# next, filter out genes with very low average nonzero expression across all cells (requres a mean of at least 5 counts across all cells)
keep <- rowMeans(counts(eset)) >= 5
eset <- eset[keep,]
sum(keep)
## [1] 13461
# normalize counts for library size using the pool & deconvolve method of Lun et al. (2016)
eset <- computeSumFactors(eset, sizes=c(20, 40, 60, 80))
summary(sizeFactors(eset))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1164  0.7428  0.9664  1.1410  1.2770  8.7680
# use the size factors calculated above to normalize the counts - these get placed in the 'exprs' slot
eset <- normalize.SCESet(eset)

Let’s check the correlation of our size factors with the library size. The strong positive correlation suggests that the size factors calculated by this normalization procedure are primarily adjusting for sequencing depth and overall capture rate.

plot(sizeFactors(eset), colSums(counts(eset))/1e6, log="xy",
    ylab="Library Size (Total Counts in Millions)", xlab="Pooled Size Factor Estimate",
    main="Normalization factor versus library size")

Next, we’ll perform some calculations to help us identify genes that have high variability. To do so, we have to take into account the relationship between mean expression level and variance of expression level. Namely, that what we observe in RNA-seq count data is that genes with higher expression have higher variance (also commonly referred to as ‘over-dispersion’ in the literature). Note that for data on the log-scale, the trend is reverse (variance of log-expression decreases for higher mean log-expression genes). When overdispersion is present, this is also a property of the Negative Binomial distribution, which is often used to model count data from RNA-seq experiments.

We use the trendVar function of the scran package to estimate the relationship between the mean and variance. This function fits a smooth type of curve to capture the overall trend in mean log-expression and variance log-expression, which will serve as an estimate of the baseline technical variability if we assume that the majority of the genes are constantly expressed (i.e. that there is no significant biological variability in the majority of genes). Once this relationship has been estimated, the decomposeVar (also in the scran package) essentially removes the estimated technical variability component from the total variability to calculate the estimated biological variability. This is what we aim to use to find the highly variable genes. We also visualize the mean and variance log-expression relationship, along with the estimated technical variabilty fit.

var.fit <- trendVar(eset, trend="loess", use.spikes=FALSE, span=0.2)
var.out <- decomposeVar(eset, var.fit)

# plot the mean versus variance of log-expression, along with the technical variance fit
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression")
o <- order(var.out$mean)
lines(var.out$mean[o], var.out$tech[o], col="dodgerblue", lwd=2)

The dip in the beginning of the plot is due to the very low expressed genes having very low variance since they are dominated by zeroes and very low counts. We can see this if we remake the plot but color the points by the proportion of cells with a zero value.

# function to create a gradient of colors
color.gradient <- function(x, colors=c("orange2", "blue"), colsteps=500) {
  return( colorRampPalette(colors) (colsteps) [ findInterval(x, seq(min(x),max(x), length.out=colsteps)) ] )
}
pzero <- apply(counts(eset), 1, function(x) sum(x > 0) / length(x))

# replot with color by proportion of zero
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression", col=color.gradient(pzero))
colscale <- c(0,0.25,0.5,0.75,1)
legend(11.5,40, title="Proportion cells zero", legend=paste0(1-round(quantile(pzero, colscale),2)),
        col=color.gradient(quantile(pzero, colscale)), pch=16)

Recall the assumption we made in estimating the mean-variance relationship above: there is no significant biological variability in the majority of genes. Is this true in our experiment? How might this assumption be evaluated?

If this assumption is suspect, it is ideal to independently estimate this relationship for so-called ‘spike-in’ data (where a small number of artifical transcripts have been added at known concentrations such that any variability is assured to be technical). Fortunately, in this experiment we have ERCC control spike-ins, so we’ll repeat the steps above to estimate the mean-variance relationship only on the spike-ins. We do this by using the use.spikes=TRUE option in the trendVar function. We also increase the smoothing span a bit since we are using fewer points for estimation.

var.fit.spike <- trendVar(eset, trend="loess", use.spikes=TRUE, span=0.3)
var.out.spike <- decomposeVar(eset, var.fit.spike)

# plot the mean versus variance of log-expression, along with the technical variance fit
plot(var.out.spike$mean, var.out.spike$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression")
o <- order(var.out$mean)
lines(var.out$mean[o], var.out$tech[o], col="dodgerblue", lwd=2)
o <- order(var.out.spike$mean)
lines(var.out.spike$mean[o], var.out.spike$tech[o], col="red", lwd=2)
points(var.fit.spike$mean, var.fit.spike$var, col="red", pch=16)

How similar are the two trend estimations? What does this suggest about the assumption that most genes do not have significant biological variability?

Finally, we extract the top 1000 genes with highest biological variability to use in downstream analyses. We’ll also examine the distributions of the top 25 highly variable genes. Here we will use the estimate of biological variability we obtained by fitting the spike-ins above. However, note that spike-ins are not always available, they can be challenging to incorporate into the protocol, and won’t necessarily capture all of the technical variability depending on where in the protocol they are added.

# extract and examine the top 1000 genes by biological variance
top.hvg <- order(var.out.spike$bio, decreasing=TRUE)[1:1000]
head(var.out.spike[top.hvg,])
##            mean    total      bio      tech
## Gad1   6.587978 45.91174 31.96631 13.945424
## Slc6a1 7.176197 42.18583 29.90373 12.282101
## Cnr1   8.608585 32.69374 24.90063  7.793106
## Enc1   9.133358 29.08431 22.97221  6.112099
## Rgs4   8.258177 30.38597 21.49162  8.894354
## Synpr  5.588791 36.68112 20.72181 15.959308
# construct a new eset object that only contains the highly variable genes for downstream analysis
eset.hvg <- eset[top.hvg,]

# plot distribution of the top 25 highly variable genes
top25 <- top.hvg[1:25]
boxplot(t(exprs(eset)[top25,]), las=2, ylab="Normalized log-expression", col="dodgerblue", main="Top 25 Highly Variable Genes")

Note that the biological variability estimates provided by seem to be rather robust to outlier cells, as none of the genes with highest variability seem to be driven by outliers.

Let’s also look again at the mean log-expression versus variance log-expression plot we generated above, but this time highlight the locations of top 1000 and top 25 highly variable genes.

# plot the mean versus variance of log-expression, along with the technical variance fit
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression")
o <- order(var.out.spike$mean)
lines(var.out.spike$mean[o], var.out.spike$tech[o], col="red", lwd=2)
points(var.out$mean[top.hvg], var.out$total[top.hvg], col="darkorange2", cex=0.6, pch=16)
points(var.out$mean[top25], var.out$total[top25], col="red4", cex=0.7, pch=15)
legend(11.5, 40, legend=c("Top 25 HVGs", "Top 1000 HVGs"), pch = c(15, 16), 
       col= c("red4", "darkorange2"), cex=1.1)

Finally, we’ll create a heatmap plot to visualize how the overall expression of the top 25 HVGs are associated with the major cell classes.

# extract matrix of hvg expression for plotting
m <- exprs(eset)[top25,]

# plot heatmap 
library(RColorBrewer)
heatmap(m/apply(m,1,max),zlim=c(0,1),labCol=NA, col=brewer.pal(9,"YlOrRd"),
        scale="none",ColSideColors=rainbow(5)[as.numeric(factor(phenoData(eset)$major_class))])
par(lend = 1)           # square line ends for the color legend
legend("topleft", inset=-0.04,
    legend = unique(phenoData(eset)$major_class),
    col = rainbow(5)[as.numeric(as.factor(unique(phenoData(eset)$major_class)))],
    lty= 1, lwd = 5, cex=0.6
)

Identify differentially expressed genes

In this module we will identify differentially expressed genes between neuronal subtypes, as well as between neurons and non-neuronal cell types. As we have learned in this workshop, though scRNA-seq data shares many characteristics with bulk RNA-seq data, there are major differences which need to be accommodated in order to properly analyze it and fully exploit its advantages (namely the presence of dropouts, or zeroes, and increased heterogeneity from a combination of technical and biological sources). First we will explore the method SCDE which compares expression magnitude between groups of cells while adjusting for drop-out and amplification biases in expression magnitude by fitting error models for individual cells.

After loading the SCDE package, we’ll create a vector of indices to indicate which cells we will analyze. In the interest of minimizing computation time during this workshop, we’ll only assess the top 1000 highly variable genes for differential expression (these are contained in the eset.hvg SCESet object). In addition, some of the SCDE output has been precomputed for you since the method is computationally intensive with a large number of cells (over 1000). To complete the rest of the module, you’ll need to download this results object from GitHub if you are not using one of the server instances (where it will be preloaded) - see the instructions below.

We’ll first create a vector that indicates which of the cells have been classified as neurons. This will be used to subset the SCESet to contain only these cells. Then we’ll reduce this indicator vector to only contain a random sample of 100 cells of each neuron subtype. We’ll also create a group factor variable that indicates which type of neuron each cell is classified as (inhibitory or exitatory). Finally, since SCDE requires raw (unnormalized) integer count data as input, we’ll extract the raw counts matrix from the subsetted SCESet object and change their class from numeric to integer.

library(scde)

# find DE genes between excitatory and inhibitory neuronal subtypes
# first, find which cells are neuron (inhibitory or excitatory)
which.neur <- which(pData(eset.hvg)$major_class %in% c("Inhibitory", "Excitatory"))

# next, construct the group factor labeling for neuron subtype using the index of neuron cells `which.neur' to subset the SCESet object
group <- factor(pData(eset.hvg[,which.neur])$major_class)

# re-create the group factor for the subsampled cell indices
group <- factor(pData(eset.hvg[,which.neur])$major_class)
names(group) <- sampleNames(eset.hvg[,which.neur])

# save counts matrix as integer class; note SCDE requires raw (unnormalized) counts as input. 
cts <- apply(counts(eset.hvg[,which.neur]),2,function(x) {storage.mode(x) <- 'integer'; x}) 

With the SCDE inputs prepared, we’re ready to fit the SCDE models for these genes. To do so, we first use the scde.error.models function to fit the cell-specific error models. These error models are then used as input to fit the models for the prior distributions of gene expression magnitude using the scde.expression.prior function. Then, both the error models and prior are used as input to the main function scde.expression.difference function, which performs a test of differential expression for each gene. The output returns a table that contains the value of the test statistic for each gene, so we’ll carry out an additional step to obtain the p-values that correspond to the test statistics. Note that the columns starting with a ‘c’ correspond to those that have been corrected for multiple comparisons.

Note that these model-fitting step can take quite some time, since an individual error model is fit for each cell in the dataset, and pairs of cells are compared to one another in order to estimate the parameters of the fit. For the purposes of this workshop, the error models object has been precomputed (If you are using the RStudio Server instance, simply load the scde.results.RData file. If you are using your own machine, you must first download this object from the GitHub page for this vignette - https://raw.githubusercontent.com/kdkorthauer/FestivalWorkshopVignettes/master/scde.results.RData and then save it to your R sessions current working directory).

# fit error models using the scde.error.models function (the following steps have already been run in the interest of time - it is computationally intensive)
########## err.mod <- scde.error.models(counts = cts, groups = group, n.cores = 10, min.nonfailed = 30,
##########                                   verbose=1, save.crossfit.plots=FALSE, threshold.segmentation = TRUE,
##########                         min.size.entries = 500, save.model.plots = FALSE, linear.fit=FALSE,
##########                         min.pairs.per.cell=20, max.pairs=10000)

# estimate the prior for gene expression 
########## prior.mod <- scde.expression.prior(models = err.mod, counts = cts, length.out = 400, show.plot = FALSE, max.quantile=0.9999)

# test for differential expression for all genes using the scde.expression.difference function
########## exp.diff <- scde.expression.difference(models = err.mod, counts = cts, prior = prior.mod, groups = group, n.cores = 1)

Again, the preceding steps are commented out since they have already been run for you. Instead, load the results object that contains err.mod, prior.mod, and exp.diff. We’ll add a column to the main results table that displays the p-values associated with the test statistics (both raw and multiplicity-adjusted).

# load the pre-computed results objects err.mod, prior.mod, and exp.diff (all contained
# in the file "scde.results.RData").  Assumes the file is saved to the current working
# directory
load("scde.results.RData")

# view the top of the main results table
head(exp.diff)
##                 lb        mle         ub          ce         Z        cZ
## Gad1    0.58992655  0.9075793  1.1798531  0.58992655  7.153782  6.504284
## Slc6a1  0.45378966  0.8168214  1.1344741  0.45378966  5.496958  4.860499
## Cnr1    0.04537897  0.7260634  1.3613690  0.04537897  2.090673  1.461324
## Enc1   -0.86220035 -0.4991686 -0.1815159 -0.18151586 -2.868074 -2.220649
## Rgs4   -1.04371621 -0.7260634 -0.4537897 -0.45378966 -5.557564 -4.922400
## Synpr   0.77144242  1.1798531  1.6336428  0.77144242  7.151443  6.504284
# add a column with p-values and multiplicity-corrected pvalues
exp.diff$Pval <- 2*pnorm(-abs(exp.diff$Z))
exp.diff$cPval <- p.adjust(exp.diff$Pval, method="fdr")

We can explore these results to see how many genes are differentially expressed at the 0.05 level after adjusting for mutliple comparisons, and save these results in a .csv file (viewable in Microsoft Excel or a simple text editor).

# count how many genes have an fdr-adjusted p-value less than 0.05
sum(exp.diff$cPval < 0.05)
## [1] 184
# reorder the genes by pvalue
exp.diff <- exp.diff[order(exp.diff$cPval),]
head(exp.diff)
##                 lb       mle       ub        ce        Z       cZ
## Gad1     0.5899266 0.9075793 1.179853 0.5899266 7.153782 6.504284
## Synpr    0.7714424 1.1798531 1.633643 0.7714424 7.151443 6.504284
## Zcchc12  0.7714424 1.1798531 1.542885 0.7714424 7.160813 6.504284
## Serpini1 0.6806845 0.9529583 1.225232 0.6806845 7.160813 6.504284
## Rab3b    0.8622003 1.2252321 1.497506 0.8622003 7.160813 6.504284
## Sst      1.2706110 1.9512955 2.768117 1.2706110 7.160813 6.504284
##                  Pval        cPval
## Gad1     8.441938e-13 7.806413e-11
## Synpr    8.587054e-13 7.806413e-11
## Zcchc12  8.020003e-13 7.806413e-11
## Serpini1 8.020006e-13 7.806413e-11
## Rab3b    8.020000e-13 7.806413e-11
## Sst      8.020000e-13 7.806413e-11
# create a .csv file that lists all the DE genes with fdr-adjusted p-value less than 0.05
write.csv(exp.diff[exp.diff$cPval < 0.05, ], file = "scdeResults_DEgenes.csv", row.names = TRUE, quote = FALSE)

To get more insight into the models fit by SCDE, we can use the related scde.test.expression.difference function to visualize the results for a particular gene. For example, we can view the cell-specific posterior distributions for the two different neuronal subtypes for a DE gene:

# visualize the results for a particular DE gene
scde.test.gene.expression.difference("Gad1", models = err.mod, counts = cts, prior = prior.mod)

##             lb       mle        ub        ce         Z        cZ
## Gad1 -10.48254 -10.02875 -9.620341 -9.620341 -7.160847 -7.160847

In contrast, we can look at another gene that is not differentially expressed and see that the cell-specific posterior distributions do not look globally different between excitatory and inhibitory neurons.

scde.test.gene.expression.difference("Rgs17", models = err.mod, counts = cts, prior = prior.mod)

##               lb        mle        ub ce         Z        cZ
## Rgs17 -0.1361369 0.09075793 0.3176528  0 0.6914242 0.6914242
# reset plot window
dev.off()
## null device 
##           1

Finally, we’ll create a heatmap plot to visualize how the overall expression of differentially expressed genes found by SCDE are associated with neuronal subtype.

# extract matrix of hvg expression for plotting
m <- exprs(eset)[rownames(eset) %in% rownames(exp.diff[exp.diff$cPval < 0.05,]), colnames(eset) %in% names(group)]

# plot heatmap 
library(RColorBrewer)
heatmap(m/apply(m,1,max),zlim=c(0,1),labCol=NA, col=brewer.pal(9,"YlOrRd"),
        scale="none",ColSideColors=rainbow(2)[as.numeric(as.factor(group))])
par(lend = 1)           # square line ends for the color legend
legend("topleft", inset=-0.04,
    legend = unique(group),
    col = rainbow(2)[as.numeric(as.factor(unique(group)))],
    lty= 1, lwd = 5, cex=0.6
)

Next we’ll explore the method scDD which also compares expression between groups of cells, but aims to detect differences in expression patterns that may be more complex than an overall magnitude change (or mean shift). Specifically, cells within each group may be captured at different expression states, as may happen if a gene is oscillatory, or if it exhibits stochastic, “bursty” expression dynamics. scDD can detect if these types of complex patterns differ across cell groups.

Our aim is two-fold: (1) to detect which genes have different expression distributions between the two neuronal subtypes and (2) to classify those differences into informative patterns. Note that in (1) we explicitly say differences in ’distributions’ rather than differences in ’average’, which would correspond to traditional DE (differential expression) analysis in bulk RNA-seq. By examining the entire distribution, we are able to detect more subtle differences as well as describe complex patterns, such as the existence of subgroups of cells within and across condition that express a given gene at a different level.

First, we construct an ExpressionSet object that contains our count data and cell metadata for input to scDD. Note that this is very similar to the SCESet object class used by scater; that class is based on the ExpressionSet class. We’ll reuse the group and which.neur neuronal subtype vector since they already contain the subset cells we wish to analyze. Note that scDD expects normalized counts, so we extract them from the eset.hvg object using the exprs function. We’ll also create a list of the hyperparameters to use in model-fitting.

library(scDD)
library(Biobase)

# construct object to send to scDD; we'll reuse group object created for SCDE analysis and create a new (raw-scale) normalized counts matrix
condition <- as.numeric(as.factor(group))
names(condition) <- names(group)
cts <- exp(exprs(eset.hvg[,which.neur])) # scDD requires normalized counts on the raw scale (not logged)
cts <- cts - min(cts)                    # scran added a pseudocount after normalization; need to keep zero counts as zeroes
cts[cts < 1e-6] <- 0                     # account for floating point error in estimating psuedocount
eset.scdd <- ExpressionSet(assayData=cts,
                     phenoData=as(data.frame(condition), "AnnotatedDataFrame"))
rm(cts)

# Create list of prior parameters for model fitting 
prior_param=list(alpha=0.01, mu0=0, s0=0.01, a0=0.01, b0=0.01)

Next, we’re ready to use the main function scDD to detect and classify differentially distributed genes. In the interest of time, we avoid the computationally intensive permutation testing step and instead use the Kolmogorov-Smirnov test.

# find DE genes between excitatory and inhibitory neuronal subtypes
dd.results <- scDD(eset.scdd, prior_param=prior_param, permutations=0, testZeroes=FALSE)
## [1] "Clustering observed expression data for each gene"
## Notice! Number of permutations is set to zero; using Kolmogorov-Smirnov to test for differences in distributions instead of the Bayes Factor permutation test
## [1] "Classifying significant genes into patterns"
head(dd.results$Genes)
##          gene nonzero.pvalue nonzero.pvalue.adj zero.pvalue
## Gad1     Gad1              0                  0          NA
## Slc6a1 Slc6a1              0                  0          NA
## Cnr1     Cnr1              0                  0          NA
## Enc1     Enc1              0                  0          NA
## Rgs4     Rgs4              0                  0          NA
## Synpr   Synpr              0                  0          NA
##        zero.pvalue.adj DDcategory Clusters.combined Clusters.c1
## Gad1                NA         DE                 2           1
## Slc6a1              NA         DE                 1           1
## Cnr1                NA         NC                 2           2
## Enc1                NA         DB                 2           1
## Rgs4                NA         DB                 3           1
## Synpr               NA         NC                 2           2
##        Clusters.c2
## Gad1             1
## Slc6a1           1
## Cnr1             2
## Enc1             2
## Rgs4             2
## Synpr            2
# how many genes were significantly differentially distributed?
sum(dd.results$Genes$nonzero.pvalue.adj < 0.05)
## [1] 828
# what categories to the DD genes belong to?
table(dd.results$Genes$DDcategory)
## 
##  DB  DE  DM  DP  NC  NS 
## 151 317  81  35 249 167

We see that the majority of these HVGs are differentially distributed. Note that we expect to detect more differential genes than with SCDE, since scDD can also detect changes that are more subtle and complex than an overall mean shift.

Let’s take a look at the distributions of the two genes we plotted for SCDE. The Gad1 gene was detected as significantly differentially distributed, and was categorized as “Differentially Expressed” (difference in overall mean expression between the cell types). In these plots, the cell type labels have been converted to numeric, but 1=Excitatory and 2=Inhibitory.

dd.results$Genes[dd.results$Genes$gene == "Gad1",]
##      gene nonzero.pvalue nonzero.pvalue.adj zero.pvalue zero.pvalue.adj
## Gad1 Gad1              0                  0          NA              NA
##      DDcategory Clusters.combined Clusters.c1 Clusters.c2
## Gad1         DE                 2           1           1
sideViolin(exprs(eset.scdd)[rownames(eset.scdd) ==  "Gad1",], phenoData(eset.scdd)$condition,
            title.gene="Gad1")
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

The Rgs17 gene was not differnetially distributed.

dd.results$Genes[dd.results$Genes$gene == "Rgs17",]
##        gene nonzero.pvalue nonzero.pvalue.adj zero.pvalue zero.pvalue.adj
## Rgs17 Rgs17      0.2528578           0.274547          NA              NA
##       DDcategory Clusters.combined Clusters.c1 Clusters.c2
## Rgs17         NS                 2           1           2
sideViolin(exprs(eset.scdd)[rownames(eset.scdd) ==  "Rgs17",], phenoData(eset.scdd)$condition,
            title.gene="Rgs17")
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

And to illustrate some of the more complex distributional patterns, we’ll take a look at some more differentially distributed genes. The Vip gene was classified as having “differential modality”, meaning that two subgroups of cells were detected in the Inhibitory cells and only one of these was detected in the Excitatory cells.

dd.results$Genes[dd.results$Genes$gene == "Vip",]
##     gene nonzero.pvalue nonzero.pvalue.adj zero.pvalue zero.pvalue.adj
## Vip  Vip              0                  0          NA              NA
##     DDcategory Clusters.combined Clusters.c1 Clusters.c2
## Vip         DM                 2           1           2
sideViolin(exprs(eset.scdd)[rownames(eset.scdd) ==  "Vip",], phenoData(eset.scdd)$condition,
            title.gene="Vip")
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

The Pla2g7 gene was classified as having multiple underlying levels of expression in each cell type (multi-modality), however there were a “differential proportion” of excitatory and inhibitory cells belonging to each expression level. Specifically, more of the inhibitory cells expressed Pla2g7 at the high expression mode.

dd.results$Genes[dd.results$Genes$gene == "Pla2g7",]
##          gene nonzero.pvalue nonzero.pvalue.adj zero.pvalue
## Pla2g7 Pla2g7              0                  0          NA
##        zero.pvalue.adj DDcategory Clusters.combined Clusters.c1
## Pla2g7              NA         DP                 2           2
##        Clusters.c2
## Pla2g7           2
sideViolin(exprs(eset.scdd)[rownames(eset.scdd) ==  "Pla2g7",], phenoData(eset.scdd)$condition,
            title.gene="Pla2g7")
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

Finally, the Scg2 gene was classified as having both “differential modality” and different overal mean expression level. Specifically, two subgroups of cells were detected in the excitatory cells and one in the inhibitory, but none of the subgroups overlapped between the cell types.

dd.results$Genes[dd.results$Genes$gene == "Scg2",]
##      gene nonzero.pvalue nonzero.pvalue.adj zero.pvalue zero.pvalue.adj
## Scg2 Scg2              0                  0          NA              NA
##      DDcategory Clusters.combined Clusters.c1 Clusters.c2
## Scg2         DB                 1           2           1
sideViolin(exprs(eset.scdd)[rownames(eset.scdd) ==  "Scg2",], phenoData(eset.scdd)$condition,
            title.gene="Scg2")
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

Lastly, we’ll create a heatmap plot to visualize how the overall expression of differentially distributed genes found by scDD are associated with neuronal subtype. We’ll look at just the “differential modality” genes here.

# extract matrix of hvg expression for plotting
DMgenes <- dd.results$Genes$gene[dd.results$Genes$DDcategory == "DM"] 
m <- exprs(eset)[rownames(eset) %in% DMgenes, colnames(eset) %in% names(group)]

# plot heatmap 
library(RColorBrewer)
heatmap(m/apply(m,1,max),zlim=c(0,1),labCol=NA, col=brewer.pal(9,"YlOrRd"),
        scale="none",ColSideColors=rainbow(2)[as.numeric(as.factor(group))])
par(lend = 1)           # square line ends for the color legend
legend("topleft", inset=-0.04,
    legend = unique(group),
    col = rainbow(2)[as.numeric(as.factor(unique(group)))],
    lty= 1, lwd = 5, cex=0.6
)

Order cells by “Psuedotime” (temporal-spatial variation)

Just like we saw in the Highly Variable Genes module, it can be convenient to store all information about an experiment (measurements and metadata) in one R object. This is a recurring theme in the Bioconductor project, so many R packages on Bioconductor rely on constructs like these. The benefits are gains in efficiency and reduction of errors in associating metadata objects back to the measurement objects, and often many package developers use objects that are compatible across several methods. However, with the explosion in methods development for scRNA-seq analysis, we have not yet established a ‘standard’ object type for storing and analysing scRNA-seq data in R. The result is that packages developed simulataneously and independently have slightly different formatting requirements. So long story short, we have to convert our SCESet object from the scater package to a similar CellDataSet object for use with the monocle package. Don’t worry about the details here.

Our goal in this analysis module is to discover a latent trajectory of variation using Monocle which may represent spatial organization of the cells. Since it is recommended to use genes that are believed to be important in determining where cells are located in relation to eachother (the ‘ordering’), we use the dataset that only contains the top 2000 highly variable genes that we constructed a previous analysis module.

library(monocle)
# construct a CellDataSet object with our SCESet object that contains only the top 2000 highly variable genes
cset <- newCellDataSet(cellData = exprs(eset.hvg), phenoData = phenoData(eset.hvg))
class(cset)
## [1] "CellDataSet"
## attr(,"package")
## [1] "monocle"

The setOrderingFilter function requires that we select which genes to use in the ordering. Since we have already filtered our cset dataset to include only the top 2000 highly variable genes, ordered by biological variance, we can subset the top 100 genes to include the top 100 highly variable genes. Before applying Monocle’s ordering algorithm, we use the reduceDimensions function to perform dimension reduction on this set of 2000 genes. Finally, the orderCells function carries out the Monocle algorithm. Note that we need to specify the number of paths here, which represent the number of main cell fates (or states) believed to be present. Here we choose one, which represents the path along the cortical layers as linear. Note that this step can take several minutes to complete.

Since we have already demonstrated that major cell type (inhibitory neuron, excitatory neuron, and non-neuronal cell) is a major source of variation, in this module we will examine the trajectories within the major neuronal cell types. To do so, we create two separate CellDataSets by subsetting on the major class listed in the phenotypic metadata and carrying out the Monocle algorithm separately on both. We also subset on the cells that have a major labeled dissection layer other than “All” (lower versus upper in Inhibitory neurons; L1-L6 in Excitatory neurons).

options(expressions = 500000) # 'under-the-hood' option; need to execute if using OSX or Windows due to a limitation on C stack size

# Run Monocle on subset of Inhibitory neurons
cset.inhibitory <- cset[,phenoData(cset)$major_class=="Inhibitory" & 
                         phenoData(cset)$layer_dissectoin %in% c("lower", "upper")]
cset.inhibitory <- setOrderingFilter(cset.inhibitory, ordering_genes=rownames(cset.inhibitory)[1:100])
cset.inhibitory <- reduceDimension(cset.inhibitory, use_irlba = FALSE) # Reduce dimensionality
## Reducing to independent components
cset.inhibitory <- orderCells(cset.inhibitory, num_paths = 1, reverse = FALSE) # Order cells

# Run Monocle on subset of Excitatory neurons
cset.excitatory <- cset[,phenoData(cset)$major_class=="Excitatory" & 
                         phenoData(cset)$layer_dissectoin %in% c("L1", "L2/3", "L4", "L5", "L6", "L6a", "L6b")]
cset.excitatory <- setOrderingFilter(cset.excitatory, ordering_genes=rownames(cset.excitatory)[1:100])
cset.excitatory <- reduceDimension(cset.excitatory, use_irlba = FALSE) # Reduce dimensionality
## Reducing to independent components
cset.excitatory <- orderCells(cset.excitatory, num_paths = 1, reverse = FALSE) # Order cells

Next we plot the spanning tree to visualize the cell ordering projected on the first two components of variation. We color the cells in this space by dissection layer see if we have recovered any of the spatial structure of the cortical layers. We also color the cells by Cre reporter line. Did we recover any spatial organization in these two subsets? Does there seem to be variation by Cre reporter?

# plotting by various factors
plot_spanning_tree(cset.inhibitory, color_by="layer_dissectoin") # plot spanning tree

plot_spanning_tree(cset.inhibitory, color_by = "cre")

# Excitatory
plot_spanning_tree(cset.excitatory, color_by="layer_dissectoin") # plot spanning tree

plot_spanning_tree(cset.excitatory, color_by="cre") # plot spanning tree

How does the spanning tree change if we allow a different number of paths? How does the spanning tree change if we include different gene sets?

R Session information:

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] stats4    splines   parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2     devtools_1.12.0        scDD_1.1.0            
##  [4] scran_1.0.4            scater_1.0.4           BiocParallel_1.6.6    
##  [7] monocle_1.6.2          plyr_1.8.4             igraph_1.0.1          
## [10] VGAM_1.0-2             ggplot2_2.1.0          Biobase_2.32.0        
## [13] BiocGenerics_0.18.0    HSMMSingleCell_0.106.2 scde_2.0.1            
## [16] flexmix_2.3-13         lattice_0.20-33       
## 
## loaded via a namespace (and not attached):
##  [1] minqa_1.2.4               colorspace_1.2-6         
##  [3] rjson_0.2.15              modeltools_0.2-21        
##  [5] mclust_5.2                dynamicTreeCut_1.63-1    
##  [7] RcppArmadillo_0.7.400.2.0 MatrixModels_0.4-1       
##  [9] AnnotationDbi_1.34.4      extRemes_2.0-8           
## [11] Lmoments_1.2-3            tximport_1.0.3           
## [13] knitr_1.14                spam_1.4-0               
## [15] EBSeq_1.12.0              nloptr_1.0.4             
## [17] Cairo_1.5-9               pbkrtest_0.4-6           
## [19] cluster_2.0.4             shinydashboard_0.5.1     
## [21] shiny_0.13.2              assertthat_0.1           
## [23] Matrix_1.2-7.1            limma_3.28.21            
## [25] formatR_1.4               htmltools_0.3.5          
## [27] quantreg_5.29             tools_3.3.1              
## [29] blockmodeling_0.1.8       coda_0.18-1              
## [31] gtable_0.2.0              reshape2_1.4.1           
## [33] dplyr_0.5.0               maps_3.1.1               
## [35] Rcpp_0.12.7               gdata_2.17.0             
## [37] nlme_3.1-128              stringr_1.1.0            
## [39] testthat_1.0.2            lme4_1.1-12              
## [41] mime_0.5                  irlba_2.1.1              
## [43] gtools_3.5.0              XML_3.98-1.4             
## [45] distillery_1.0-2          edgeR_3.14.0             
## [47] zlibbioc_1.18.0           MASS_7.3-45              
## [49] zoo_1.7-13                scales_0.4.0             
## [51] pcaMethods_1.64.0         rhdf5_2.16.0             
## [53] SparseM_1.7               fields_8.4-1             
## [55] yaml_2.1.13               memoise_1.0.0            
## [57] gridExtra_2.2.1           biomaRt_2.28.0           
## [59] fastICA_1.2-0             stringi_1.1.1            
## [61] RSQLite_1.0.0             Rook_1.1-1               
## [63] S4Vectors_0.10.3          caTools_1.17.1           
## [65] chron_2.3-47              matrixStats_0.50.2       
## [67] bitops_1.0-6              RMTstat_0.3              
## [69] arm_1.9-1                 evaluate_0.9             
## [71] labeling_0.3              cowplot_0.6.2            
## [73] magrittr_1.5              R6_2.1.3                 
## [75] IRanges_2.6.1             gplots_3.0.1             
## [77] combinat_0.0-8            DBI_0.5                  
## [79] withr_1.0.2               mgcv_1.8-14              
## [81] abind_1.4-5               RCurl_1.95-4.8           
## [83] nnet_7.3-12               tibble_1.2               
## [85] crayon_1.3.2              car_2.1-3                
## [87] KernSmooth_2.23-15        rmarkdown_1.0            
## [89] viridis_0.3.4             grid_3.3.1               
## [91] data.table_1.9.6          digest_0.6.10            
## [93] xtable_1.8-2              httpuv_1.3.3             
## [95] brew_1.0-6                outliers_0.14            
## [97] munsell_0.4.3